Tolerating Silent Data Corruption in Opaque Preconditioners
نویسندگان
چکیده
We demonstrate algorithm-based fault tolerance for silent, transient data corruption in “black-box” preconditioners. We consider both additive Schwarz domain decomposition with an ILU(k) subdomain solver, and algebraic multigrid, both implemented in the Trilinos library. We evaluate faults that corrupt preconditioner results in both single and multiple MPI ranks. We then analyze how our approach behaves when then application is scaled. Our technique is based on a Selective Reliability approach that performs most operations in an unreliable mode, with only a few operations performed reliably. We also investigate two responses to faults and discuss the performance overheads imposed by each. For a non-symmetric problem solved using GMRES and ILU, we show that at scale our fault tolerance approach incurs only 22% overhead for the worst case. With detection techniques, we are able to reduce this overhead to 1.8% in the worst case.
منابع مشابه
Realistic Failures in Secure Multi-party Computation
In secure multi-party computation, the different ways in which the adversary can control the corrupted players are described by different corruption types. The three most common corruption types are active corruption (the adversary has full control over the corrupted player), passive corruption (the adversary sees what the corrupted player sees) and fail-corruption (the adversary can force the ...
متن کاملELLIOTT III , JAMES JOHN . Resilient Iterative Linear Solvers Running Through Errors
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direction of Frank Mueller.) Future extreme-scale computer systems may expose incorrect behavior to applications, in order to save energy or increase performance. However, resilience research struggles to come up with useful abstract programming models for reasoning about faults in applications. This ...
متن کاملdata corruption in the storage stack : a closer look
[email protected] o n e o f t h e b i g g e s t c h a l l e n g e s i n designing storage systems is providing the reliability and availability that users expect. A serious threat to reliability is silent data corruption (i.e., corruption not detected by the disk drive). In order to develop suitable protection mechanisms against corruption, it is essential to understand its characteristics. In ...
متن کاملHARDFS: hardening HDFS with selective and lightweight versioning
We harden the Hadoop Distributed File System (HDFS) against fail-silent (non fail-stop) behaviors that result from memory corruption and software bugs using a new approach: selective and lightweight versioning (SLEEVE). With this approach, actions performed by important subsystems of HDFS (e.g., namespace management) are checked by a second implementation of the subsystem that uses lightweight,...
متن کاملTolerating Soft Errors in Processor Cores Using CLEAR (Cross-Layer Exploration for Architecting Resilience)
We present CLEAR (Cross-Layer Exploration for Architecting Resilience), a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1404.5552 شماره
صفحات -
تاریخ انتشار 2014